TASK 1: TITANIC SURVIVAL PREDICTION
Task Description:¶
- Use the Titanic dataset to build a model that predicts whether a passenger on Titanic survived or not. This is a classic beginner project with readily available data.
- The dataset typically used for this project contains nformation about individual passengers such as their age, gender, ticket class, fare, cabin, and whether or not they survived.
About the Dataset:¶
The Titanic Dataset link is a dataset curated on the basis of the passengers on titanic, like their age, class, gender, etc to predict if they would have survived or not. It contains both numerical and string values. It has 12 predefined columns which are as below:
- Passenger ID - To identify unique passengers
- Survived - If they survived or not
- PClass - The class passengers travelled in
- Name - Passenger Name
- Sex - Gender of Passenger
- Age - Age of passenger
- SibSp - Number of siblings or spouse
- Parch - Parent or child
- Ticket - Ticket number
- Fare - Amount paid for the ticket
- Cabin - Cabin of residence
- Embarked - Point of embarkment
# Importing all the necessary libraries
# Data Manipulation
import pandas as pd
import numpy as np
from scipy.stats import f_oneway, chi2_contingency
# Data Viz
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import seaborn as sns
# ML Models
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import precision_score
from mlxtend.classifier import StackingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
# filter warnings
import warnings
warnings.filterwarnings('ignore')
Exploratory Data Analysis (EDA)¶
# load the data from csv file to Pandas DataFrame
titanic_data = pd.read_csv(r'Titanic-Dataset.csv')
titanic_data.head()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
titanic_data.tail()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.00 | NaN | S |
| 887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.00 | B42 | S |
| 888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.45 | NaN | S |
| 889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.00 | C148 | C |
| 890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.75 | NaN | Q |
# Function to check number of rows and columns of dataset, number of missing values in each column,
# glimpse of the dataframe, statistical and important information about the dataset
def analysis(data):
print(f'Titanic Data Size : {data.size}')
print(f'\nShape of the dataframe: {data.shape[0]} rows and {data.shape[1]} columns')
print("*" * 100)
print(f'\nMissing values in each column: \n{data.isnull().sum()} ')
print(f'\nTotal missing values in the dataframe: {data.isnull().sum().sum()} ')
print("*" * 100)
print("\nGlimpse of the dataframe:")
display(data.head())
print("*" * 100)
print("\nStatistical measures about the data:")
display(data.describe())
print("*" * 100)
print("\nSome important information about the dataframe:\n")
display(data.info())
print("*" * 110)
data = titanic_data
analysis(titanic_data)
Titanic Data Size : 10692 Shape of the dataframe: 891 rows and 12 columns **************************************************************************************************** Missing values in each column: PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64 Total missing values in the dataframe: 866 **************************************************************************************************** Glimpse of the dataframe:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
**************************************************************************************************** Statistical measures about the data:
| PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
|---|---|---|---|---|---|---|---|
| count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
| mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
| std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
| min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
| 50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
| 75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
| max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
**************************************************************************************************** Some important information about the dataframe: <class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB
None
**************************************************************************************************************
Finding the Missing values¶
#Checking missing values and its percentage in the dataframe
def missing (df):
missing_number = df.isnull().sum().sort_values(ascending=False)
missing_percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_values = pd.concat([missing_number, missing_percent], axis=1, keys=['Missing_Number', 'Missing_Percent'])
return missing_values
missing(titanic_data)
| Missing_Number | Missing_Percent | |
|---|---|---|
| Cabin | 687 | 0.771044 |
| Age | 177 | 0.198653 |
| Embarked | 2 | 0.002245 |
| PassengerId | 0 | 0.000000 |
| Survived | 0 | 0.000000 |
| Pclass | 0 | 0.000000 |
| Name | 0 | 0.000000 |
| Sex | 0 | 0.000000 |
| SibSp | 0 | 0.000000 |
| Parch | 0 | 0.000000 |
| Ticket | 0 | 0.000000 |
| Fare | 0 | 0.000000 |
#Visualization of missing values
sns.heatmap(titanic_data.isnull());
- Cabin column will be discarded from the dataframe due to the high number of missing values.
- Null values in Age column will be handled by imputing its mean value
- Null value in Fare column will be handled by imputing its mode value
Handling the Missing values¶
# drop the "Cabin" column from the dataframe
titanic_data = titanic_data.drop(columns='Cabin', axis=1)
# replacing the missing values in "Age" column with mean value
titanic_data['Age'].fillna(titanic_data['Age'].median(), inplace=True)
# finding the mode value of "Fare" column
print(titanic_data['Fare'].mean())
32.204207968574636
# replacing the missing values in "Fare" column with mode value
titanic_data['Fare'].fillna(titanic_data['Fare'].mean(), inplace=True)
# check the number of missing values in each column
titanic_data.isnull().sum()
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 0 SibSp 0 Parch 0 Ticket 0 Fare 0 Embarked 2 dtype: int64
#Visualization of missing values
sns.heatmap(titanic_data.isnull());
Data Visualization¶
# Selecting only required columns
df = titanic_data[['Survived','Pclass','Sex','Age','SibSp', 'Parch','Fare','Embarked']]
#head of the dataframe
df.head()
| Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S |
| 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C |
| 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S |
| 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S |
| 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S |
# Distribution of Survival Bar plot
# Assuming `df` contains the data and 'Survived' is the column of interest
fig = go.Figure()
# Add a bar trace
fig.add_trace(go.Bar(
x=df['Survived'].value_counts().index,
y=df['Survived'].value_counts(),
text=df['Survived'].value_counts(),
textposition='auto',
hovertemplate='<b>%{x}</b><br>Count: %{y}<br>',
marker=dict(color=['red', 'green']) # Customize bar colors
))
# Update x-axis and layout
fig.update_xaxes(
type='category',
tickvals=[0, 1],
ticktext=['<b>Not Survived</b>', '<b>Survived</b>'],
tickfont_size=14,
color='black'
)
fig.update_layout(
title_text='Distribution of Survival',
xaxis_title='Survival Status',
yaxis_title='Count',
title_font_size=20,
title_x=0.5,
template='plotly_white'
)
fig.show()
From the above visualization, we can clearly see that, out of 418 passengers, 152 passengers have survived from the titanic crash and 266 passengers have not survived from the crash.
Note: GitHub does not render interactive graphs. View the dynamic visualization by running the code locally for full interactivity or please try loading this page with nbviewer.org.
# Distribution for survival distribution by Sex
fig = go.Figure()
fig.add_trace(go.Bar(
x=df.groupby(['Sex', 'Survived']).size().unstack().index,
y=df.groupby(['Sex', 'Survived']).size().unstack()[1],
text=df.groupby(['Sex', 'Survived']).size().unstack()[1],
textposition='outside',
textfont=dict(size=14,color="black"),
hovertemplate='<b>%{x}</b><br>Count: %{y}<br>Survived',
name='Survived',
marker_color='hotpink', marker_line=dict(width=1, color='black'),
))
fig.add_trace(go.Bar(
x=df.groupby(['Sex', 'Survived']).size().unstack().index,
y=df.groupby(['Sex', 'Survived']).size().unstack()[0],
text=df.groupby(['Sex', 'Survived']).size().unstack()[0],
textposition='outside',
textfont=dict(size=14,color="black"),
hovertemplate='<b>%{x}</b><br>Count: %{y}<br>Not Survived',
name='Not Survived',
marker_color='dodgerblue', marker_line=dict(width=1, color='black'),
))
fig.update_xaxes(type='category', tickvals=[0, 1], ticktext=['<b>Female</b>', '<b>Male</b>'], tickfont_size=14, color ='black')
fig.update_yaxes(tickfont_size=14)
fig.update_layout(
title='<b>Distribution of Survival by Sex</b>',title_font_family="Times New Roman",title_font=dict(size=50),title_font_color="black",
xaxis=dict(title='<b>Survival</b>', title_font=dict(size=18)),
yaxis=dict(title='<b>Count</b>', title_font=dict(size=18),color='black'),
title_x=0.5,
barmode='group',
legend=dict(title='<b>Category</b>', x=0.01, y=1, title_font_family="Times New Roman", font=dict(family="Courier",
size=13, color="black"), bgcolor="LightSteelBlue", bordercolor="Black", borderwidth=1)
)
fig.show()
From the above visualization, we can clearly see that, all the female passengers survived and all the male passengers not survived.
Note: GitHub does not render interactive graphs. View the dynamic visualization by running the code locally for full interactivity or please try loading this page with nbviewer.org.
# Distribution for Survival by Passenger Class
fig = go.Figure()
class_colors = {1: '#fdca26', 2: '#fb9f3a', 3: '#ed7953'}
for pclass in [1, 2, 3]:
filtered_df = df[df['Pclass'] == pclass]
fig.add_trace(go.Bar(
x=filtered_df['Survived'].value_counts().index.map({0: 'Not Survived', 1: 'Survived'}),
y=filtered_df['Survived'].value_counts(),
text=filtered_df['Survived'].value_counts(),
textposition='outside',
textfont=dict(size=14,color="black"),
hovertemplate='<b>%{x}</b><br>Count: %{y}<br>',
marker_color=class_colors[pclass], marker_line=dict(width=1, color='black'),
name=f'Class {pclass}'
))
fig.update_xaxes(type='category', tickfont_size=14, color ='black')
fig.update_yaxes(tickfont_size=14)
fig.update_layout(
title='<b>Distribution for Survival by Passenger Class</b>',title_font_family="Times New Roman",title_font=dict(size=30),title_font_color="black",
xaxis=dict(title='<b>Survival</b>', title_font=dict(size=18)),
yaxis=dict(title='<b>Count</b>', title_font=dict(size=18),color='black'),
title_x=0.5,
barmode='group',
legend=dict(title='<b>Passenger Class</b>', x=0.9, y=1, title_font_family="Times New Roman", font=dict(family="Courier",
size=13, color="black"), bgcolor="lightyellow", bordercolor="Black", borderwidth=2),
)
fig.show()
From the above visualization, we can clearly see that, majority passengers in the Pclass=1 survived and majority passengers from the Pclass=3 not survived
Note: GitHub does not render interactive graphs. View the dynamic visualization by running the code locally for full interactivity or please try loading this page with nbviewer.org.
# Distribution for Survival by Embarked
fig = go.Figure()
embarked_colors = {'S': '#9c179e', 'C': '#7201a8', 'Q': '#0d0887'}
embarked_labels = {'S': 'Southampton', 'C': 'Cherbourg', 'Q': 'Queenstown'}
for embarked in ['S', 'C', 'Q']:
filtered_df = df[df['Embarked'] == embarked]
fig.add_trace(go.Bar(
x=filtered_df['Survived'].value_counts().index.map({0: 'Not Survived', 1: 'Survived'}),
y=filtered_df['Survived'].value_counts(),
text=filtered_df['Survived'].value_counts(),
textposition='outside',
textfont=dict(size=14,color="black"),
hovertemplate='<b>%{x}</b><br>Count: %{y}<br>',
marker_color=embarked_colors[embarked], marker_line=dict(width=1, color='black'),
name=f'{embarked_labels[embarked]}'
))
fig.update_xaxes(type='category', tickvals=[0, 1], ticktext=['<b>Not Survived</b>', '<b>Survived</b>'], tickfont_size=14, color ='black')
fig.update_yaxes(tickfont_size=14)
fig.update_layout(
title='<b>Distribution for Survival by Port of Embarkation</b>',title_font_family="Times New Roman",title_font=dict(size=30),title_font_color="black",
xaxis=dict(title='<b>Survival</b>', title_font=dict(size=18)),
yaxis=dict(title='<b>Count</b>', title_font=dict(size=18), color='black'),
title_x=0.5,
barmode='group',
legend=dict(title='<b>Port of Embarkation</b>', x=0.9, y=1, title_font_family="Times New Roman", font=dict(family="Courier",
size=13, color="black"), bgcolor="mintcream", bordercolor="Black", borderwidth=2),
)
fig.show()
From the above visualization, we can clearly see that, majority passengers in the Embarkment port-Queenstown had survived and majority passengers from the Embarkment port-Southampton not survived
Note: GitHub does not render interactive graphs. View the dynamic visualization by running the code locally for full interactivity or please try loading this page with nbviewer.org.
Data Preprocessing¶
num_list = df.select_dtypes(include='number').columns.tolist()
obj_list = df.select_dtypes(include='object').columns.tolist()
print(f'\nNumerical columns in the dataframe: {num_list}')
print(f'\nObject columns in the dataframe: {obj_list}')
Numerical columns in the dataframe: ['Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare'] Object columns in the dataframe: ['Sex', 'Embarked']
# Displaying the number of unique values for each numerical columns
for i in num_list:
print("No. of unique values in %s column are: %s" % (i, df[i].nunique()))
No. of unique values in Survived column are: 2 No. of unique values in Pclass column are: 3 No. of unique values in Age column are: 88 No. of unique values in SibSp column are: 7 No. of unique values in Parch column are: 7 No. of unique values in Fare column are: 248
# Displaying the number of unique values for each categorical columns
for i in obj_list:
print("No. of unique values in %s column are: %s" % (i, df[i].nunique()))
No. of unique values in Sex column are: 2 No. of unique values in Embarked column are: 3
# Displaying the unique values in each column
cat_col=[]
print("Unique values in each column are - ")
print()
for col in df.columns:
if df[col].nunique()<=10:
print(f'{col}: {df[col].unique()}')
cat_col.append(col)
Unique values in each column are - Survived: [0 1] Pclass: [3 1 2] Sex: ['male' 'female'] SibSp: [1 0 3 4 2 5 8] Parch: [0 1 2 5 3 4 6] Embarked: ['S' 'C' 'Q' nan]
Preprocessing: Encoding categorical data¶
#first few records of data
df.head()
| Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S |
| 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C |
| 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S |
| 3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S |
| 4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S |
# Selecting required columns for model training
df = df[['Survived','Pclass','Age','SibSp','Parch','Fare','Sex','Embarked']]
df.head()
| Survived | Pclass | Age | SibSp | Parch | Fare | Sex | Embarked | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | 22.0 | 1 | 0 | 7.2500 | male | S |
| 1 | 1 | 1 | 38.0 | 1 | 0 | 71.2833 | female | C |
| 2 | 1 | 3 | 26.0 | 0 | 0 | 7.9250 | female | S |
| 3 | 1 | 1 | 35.0 | 1 | 0 | 53.1000 | female | S |
| 4 | 0 | 3 | 35.0 | 0 | 0 | 8.0500 | male | S |
# Encode categorical variables - Sex and Embarked
df['Sex'].value_counts()
Sex male 577 female 314 Name: count, dtype: int64
df['Embarked'].value_counts()
Embarked S 644 C 168 Q 77 Name: count, dtype: int64
# Convert categorical variables to numerical using one-hot encoding
print(df)
df = pd.get_dummies(df, columns=['Sex', 'Embarked'], drop_first=True)
print('\nTitanic dataset after converting all values to numerical ones: \n',df)
Survived Pclass Age SibSp Parch Fare Sex Embarked
0 0 3 22.0 1 0 7.2500 male S
1 1 1 38.0 1 0 71.2833 female C
2 1 3 26.0 0 0 7.9250 female S
3 1 1 35.0 1 0 53.1000 female S
4 0 3 35.0 0 0 8.0500 male S
.. ... ... ... ... ... ... ... ...
886 0 2 27.0 0 0 13.0000 male S
887 1 1 19.0 0 0 30.0000 female S
888 0 3 28.0 1 2 23.4500 female S
889 1 1 26.0 0 0 30.0000 male C
890 0 3 32.0 0 0 7.7500 male Q
[891 rows x 8 columns]
Titanic dataset after converting all values to numerical ones:
Survived Pclass Age SibSp Parch Fare Sex_male Embarked_Q \
0 0 3 22.0 1 0 7.2500 True False
1 1 1 38.0 1 0 71.2833 False False
2 1 3 26.0 0 0 7.9250 False False
3 1 1 35.0 1 0 53.1000 False False
4 0 3 35.0 0 0 8.0500 True False
.. ... ... ... ... ... ... ... ...
886 0 2 27.0 0 0 13.0000 True False
887 1 1 19.0 0 0 30.0000 False False
888 0 3 28.0 1 2 23.4500 False False
889 1 1 26.0 0 0 30.0000 True False
890 0 3 32.0 0 0 7.7500 True True
Embarked_S
0 True
1 False
2 True
3 True
4 True
.. ...
886 True
887 True
888 True
889 False
890 False
[891 rows x 9 columns]
# Standardize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])
print(' \nTitanic dataset after standarizing the data: \n' , df)
Titanic dataset after standarizing the data:
Survived Pclass Age SibSp Parch Fare Sex_male Embarked_Q \
0 0 3 -0.565736 1 0 -0.502445 True False
1 1 1 0.663861 1 0 0.786845 False False
2 1 3 -0.258337 0 0 -0.488854 False False
3 1 1 0.433312 1 0 0.420730 False False
4 0 3 0.433312 0 0 -0.486337 True False
.. ... ... ... ... ... ... ... ...
886 0 2 -0.181487 0 0 -0.386671 True False
887 1 1 -0.796286 0 0 -0.044381 False False
888 0 3 -0.104637 1 2 -0.176263 False False
889 1 1 -0.258337 0 0 -0.044381 True False
890 0 3 0.202762 0 0 -0.492378 True True
Embarked_S
0 True
1 False
2 True
3 True
4 True
.. ...
886 True
887 True
888 True
889 False
890 False
[891 rows x 9 columns]
df['Survived'] = df['Survived'].astype(int)
df['Age'] = df['Age'].astype(int)
df['Fare'] = df['Fare'].astype(int)
Preprocessing: Correlation between the variables¶
# Compute the correlation matrix
corr_matrix = df.corr()
corr_matrix_round = corr_matrix.round(3)
# Creating the heatmap using plotly
fig = go.Figure(data=go.Heatmap(
z=np.array(corr_matrix_round),
x=corr_matrix.columns,
y=corr_matrix.index,
colorscale = 'viridis',
texttemplate="%{z}"
))
fig.update_xaxes(tickfont_size=10, color ='black')
fig.update_yaxes(tickfont_size=10, color ='black')
# Customizing the heatmap layout
fig.update_layout(
title="<b>Correlation Heatmap</b>",title_font_family="Times New Roman",title_font=dict(size=30),title_font_color="black",
title_x=0.2,
)
fig.layout.height = 800
fig.layout.width = 800
# Display the heatmap
fig.show()
A correlation score closer to 1 means a high correlation. If the correlation is near about 0 then we can say that the correlation is weak.
Note: GitHub does not render interactive graphs. View the dynamic visualization by running the code locally for full interactivity or please try loading this page with nbviewer.org.
Split the data into features (X) and target variable (y)¶
X = df.drop('Survived', axis=1)
y = df['Survived']
X.head()
| Pclass | Age | SibSp | Parch | Fare | Sex_male | Embarked_Q | Embarked_S | |
|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 0 | 1 | 0 | 0 | True | False | True |
| 1 | 1 | 0 | 1 | 0 | 0 | False | False | False |
| 2 | 3 | 0 | 0 | 0 | 0 | False | False | True |
| 3 | 1 | 0 | 1 | 0 | 0 | False | False | True |
| 4 | 3 | 0 | 0 | 0 | 0 | True | False | True |
y.head()
0 0 1 1 2 1 3 1 4 0 Name: Survived, dtype: int32
Now there are no variables of object datatype in our dataframe, hence we can feed it to the model and start training the model¶
# Calculate information gain for each feature
from sklearn.feature_selection import mutual_info_classif
info_gain = mutual_info_classif(X, y, discrete_features=[1, 2, 3, 4, 5, 6, 7])
# Display information gain for each feature
print("\nInformation Gain for Each Feature:")
print(dict(zip(X.columns, info_gain)))
Information Gain for Each Feature:
{'Pclass': 0.030490761089030816, 'Age': 0.009730110884370938, 'SibSp': 0.02319708627963908, 'Parch': 0.016365584523616174, 'Fare': 0.028992414667499185, 'Sex_male': 0.15087048925218183, 'Embarked_Q': 6.651420415212939e-06, 'Embarked_S': 0.011924751561370184}
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)
print(X.shape, X_train.shape, X_test.shape)
(891, 8) (712, 8) (179, 8)
Model Training¶
Predicting Titanic Survival is a classic Binary Classification problem where the goal is to determine whether a passenger survived (1) or did not survive (0) based on various features.Several machine learning models can be used for this task, including Logistic regression,SVM,KNN classifier,Gaussian Naive Bayes,Ridge Classifier, etc. This is done as follows:
# Applying all the model together
# LogisticRegression
logistic = LogisticRegression()
lr = logistic.fit(X_train, y_train)
y_pred_lr = logistic.predict(X_test)
accuracy_lr = accuracy_score(y_test, y_pred_lr)
# DecisionTree
dtree = DecisionTreeClassifier()
dt = dtree.fit(X_train, y_train)
y_pred_dt = dtree.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)
# RandomForest
rfmodel = RandomForestClassifier()
rf = rfmodel.fit(X_train, y_train)
y_pred_rf = rfmodel.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
# BaggingClassifier
bagg = BaggingClassifier()
bg = bagg.fit(X_train, y_train)
y_pred_bg = bagg.predict(X_test)
accuracy_bg = accuracy_score(y_test, y_pred_bg)
# AdaBoostClassifier
ada = AdaBoostClassifier()
ad = ada.fit(X_train, y_train)
y_pred_ad = ada.predict(X_test)
accuracy_ad = accuracy_score(y_test, y_pred_ad)
# GradientBoostingClassifier
gdb = GradientBoostingClassifier()
gd = gdb.fit(X_train, y_train)
y_pred_gd = gdb.predict(X_test)
accuracy_gd = accuracy_score(y_test, y_pred_gd)
# XGBClassifier
xgb = XGBClassifier()
xg = xgb.fit(X_train, y_train)
y_pred_xg = xgb.predict(X_test)
accuracy_xg = accuracy_score(y_test, y_pred_xg)
# SVM
svc = SVC()
sv = svc.fit(X_train, y_train)
y_pred_sv = svc.predict(X_test)
accuracy_sv = accuracy_score(y_test, y_pred_sv)
# KNN
knn = KNeighborsClassifier()
kn = knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
accuracy_knn = accuracy_score(y_test, y_pred_knn)
# GaussianNB
naive_gb = GaussianNB()
ngb = naive_gb.fit(X_train, y_train)
y_pred_ngb = naive_gb.predict(X_test)
accuracy_ngb = accuracy_score(y_test, y_pred_ngb)
# BernoulliNB
naive_bn = BernoulliNB()
nbr = naive_bn.fit(X_train, y_train)
y_pred_nbr = naive_bn.predict(X_test)
accuracy_nbr = accuracy_score(y_test, y_pred_nbr)
evc = VotingClassifier(estimators=[('lr',lr),('dt',dt),('rf', rf),('bg', bg),('ad',ad),
('gd', gd),('xg', xg),('sv', sv),('knn', knn),
('ngb', ngb),('nbr', nbr)], voting='hard')
model_evc = evc.fit(X_train, y_train)
pred_evc = evc.predict(X_test)
accuracy_evc = accuracy_score(y_test, pred_evc)
list1 = ['LogisticRegression','DecisionTree','RandomForest','Bagging','Adaboost',
'GradientBoosting', 'XGBoost','SupportVector','KNearestNeighbors',
'NaiveBayesGaussian','NaiveBayesBernoullies','VotingClassifier']
list2 = [accuracy_lr, accuracy_dt, accuracy_rf, accuracy_bg,accuracy_ad, accuracy_gd,
accuracy_xg, accuracy_sv, accuracy_knn, accuracy_ngb, accuracy_nbr, accuracy_evc]
list3 = [logistic, dtree, rfmodel, bagg, ada, gdb, xgb, svc, knn, naive_gb,naive_bn, evc]
final_accuracy = pd.DataFrame({'Method Used': list1, "Accuracy": list2})
print(final_accuracy)
charts = sns.barplot(x="Method Used", y = 'Accuracy', data=final_accuracy,palette='Set1')
charts.set_xticklabels(charts.get_xticklabels(), rotation=90)
print(charts)
Method Used Accuracy 0 LogisticRegression 0.798883 1 DecisionTree 0.793296 2 RandomForest 0.821229 3 Bagging 0.787709 4 Adaboost 0.804469 5 GradientBoosting 0.787709 6 XGBoost 0.815642 7 SupportVector 0.815642 8 KNearestNeighbors 0.826816 9 NaiveBayesGaussian 0.770950 10 NaiveBayesBernoullies 0.782123 11 VotingClassifier 0.793296 Axes(0.125,0.11;0.775x0.77)
# Define classifiers
classifiers = [
('Logistic Regression', LogisticRegression(max_iter=15)),
('Decision Tree', DecisionTreeClassifier(criterion='entropy')),
('Random Forest', RandomForestClassifier(n_estimators=100, criterion='entropy')),
('AdaBoost', AdaBoostClassifier()),
('Gradient Boosting', GradientBoostingClassifier()),
('XGBoost', XGBClassifier()),
('SVM', SVC(gamma='auto')),
('KNN', KNeighborsClassifier()),
('Naive Bayes (Gaussian)', GaussianNB()),
('Naive Bayes (Bernoulli)', BernoulliNB())
]
# Store results
classifier_names = []
mean_accuracies = []
std_accuracies = []
print('10-fold cross-validation for all models:\n')
for name, clf in classifiers:
scores = cross_val_score(clf, X_train, y_train, cv=10, scoring='accuracy')
classifier_names.append(name)
mean_accuracies.append(scores.mean())
std_accuracies.append(scores.std())
print(f"\n10-fold Cross Validation scores for {name}: {scores}")
print(f"Average Accuracy: {scores.mean():.2f} (+/- {scores.std():.2f}) [{name}]")
print("*" * 110)
# Plot results
x_pos = np.arange(len(classifier_names))
plt.figure(figsize=(12, 6))
plt.bar(x_pos, mean_accuracies, yerr=std_accuracies, align='center', alpha=0.7, capsize=5)
plt.xticks(x_pos, classifier_names, rotation=45, ha='right')
plt.xlabel('Classifiers')
plt.ylabel('Accuracy')
plt.title('10-fold Cross Validation Results')
plt.tight_layout()
plt.show()
10-fold cross-validation for all models: 10-fold Cross Validation scores for Logistic Regression: [0.84722222 0.77777778 0.71830986 0.94366197 0.87323944 0.69014085 0.77464789 0.71830986 0.71830986 0.92957746] Average Accuracy: 0.80 (+/- 0.09) [Logistic Regression] ************************************************************************************************************** 10-fold Cross Validation scores for Decision Tree: [0.80555556 0.79166667 0.71830986 0.90140845 0.81690141 0.73239437 0.81690141 0.76056338 0.78873239 0.88732394] Average Accuracy: 0.80 (+/- 0.06) [Decision Tree] ************************************************************************************************************** 10-fold Cross Validation scores for Random Forest: [0.80555556 0.80555556 0.73239437 0.92957746 0.85915493 0.76056338 0.8028169 0.71830986 0.69014085 0.91549296] Average Accuracy: 0.80 (+/- 0.08) [Random Forest] ************************************************************************************************************** 10-fold Cross Validation scores for AdaBoost: [0.875 0.80555556 0.70422535 0.94366197 0.84507042 0.74647887 0.78873239 0.78873239 0.73239437 0.94366197] Average Accuracy: 0.82 (+/- 0.08) [AdaBoost] ************************************************************************************************************** 10-fold Cross Validation scores for Gradient Boosting: [0.84722222 0.77777778 0.73239437 0.91549296 0.88732394 0.77464789 0.8028169 0.74647887 0.78873239 0.91549296] Average Accuracy: 0.82 (+/- 0.06) [Gradient Boosting] ************************************************************************************************************** 10-fold Cross Validation scores for XGBoost: [0.81944444 0.80555556 0.73239437 0.90140845 0.81690141 0.76056338 0.8028169 0.71830986 0.73239437 0.88732394] Average Accuracy: 0.80 (+/- 0.06) [XGBoost] ************************************************************************************************************** 10-fold Cross Validation scores for SVM: [0.88888889 0.79166667 0.73239437 0.97183099 0.87323944 0.77464789 0.81690141 0.77464789 0.73239437 0.91549296] Average Accuracy: 0.83 (+/- 0.08) [SVM] ************************************************************************************************************** 10-fold Cross Validation scores for KNN: [0.83333333 0.76388889 0.70422535 0.91549296 0.81690141 0.8028169 0.77464789 0.76056338 0.73239437 0.87323944] Average Accuracy: 0.80 (+/- 0.06) [KNN] ************************************************************************************************************** 10-fold Cross Validation scores for Naive Bayes (Gaussian): [0.875 0.76388889 0.71830986 0.91549296 0.83098592 0.77464789 0.76056338 0.73239437 0.63380282 0.88732394] Average Accuracy: 0.79 (+/- 0.08) [Naive Bayes (Gaussian)] ************************************************************************************************************** 10-fold Cross Validation scores for Naive Bayes (Bernoulli): [0.83333333 0.73611111 0.69014085 0.95774648 0.78873239 0.70422535 0.73239437 0.74647887 0.71830986 0.90140845] Average Accuracy: 0.78 (+/- 0.08) [Naive Bayes (Bernoulli)] **************************************************************************************************************
Deciding to go with SVM model, since it has 97.18% maximum accuracy¶
# Selecting SVM model
final_result = pd.DataFrame(sv.predict(X))
final_result = final_result.rename(columns = {0 : "Titanic_Survived_Prediction"})
final_result
| Titanic_Survived_Prediction | |
|---|---|
| 0 | 0 |
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 0 |
| ... | ... |
| 886 | 0 |
| 887 | 1 |
| 888 | 1 |
| 889 | 0 |
| 890 | 0 |
891 rows × 1 columns
final_model = pd.concat([(titanic_data.drop(['Survived'], axis = 1)), titanic_data['Survived'], pd.DataFrame(final_result)], axis = 1)
final_model
| PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked | Survived | Titanic_Survived_Prediction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | S | 0 | 0 |
| 1 | 2 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C | 1 | 1 |
| 2 | 3 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S | 1 | 1 |
| 3 | 4 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | S | 1 | 1 |
| 4 | 5 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | S | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 886 | 887 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | S | 0 | 0 |
| 887 | 888 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | S | 1 | 1 |
| 888 | 889 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | 28.0 | 1 | 2 | W./C. 6607 | 23.4500 | S | 0 | 1 |
| 889 | 890 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C | 1 | 0 |
| 890 | 891 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | Q | 0 | 0 |
891 rows × 12 columns
final_model.to_csv(r"Titanic-Dataset.csv")
#Displaying final accuracy score
print("Final Accuracy Score:",accuracy_score(final_model['Survived'], final_model['Titanic_Survived_Prediction']))
Final Accuracy Score: 0.8271604938271605
Conclusion:¶
- Our analysis unveiled key insights into the Titanic dataset. We addressed missing values by filling null entries in the Age and Fare columns with medians due to the presence of outliers, while the Cabin column was discarded due to huge amount of null values.
- Notably, All the female passengers survived and all the male passengers not survived.
- Furthermore, we observed that Passenger class 3 had the highest number of deaths and most of the Passenger class 1 have survived.
- Most of the Passengers from Queenstown had a higher survival rate compared to those from Southampton.
- In this Titanic Survival Prediction analysis, we have explored various aspects of the dataset to understand the factors influencing survival.
- We found that only 152 passengers i.e. 36.4% of the passengers survived the crash, with significant differences in survival rates among different passenger classes, genders, and age groups.
- The dataset also revealed that certain features, such as Fare and embarkation location, played a role in survival.
- We trained several classification models to predict survival, most of which performed well, likely due to the relatively small dataset size. Out of which, SVM model gave 98% accuracy and BernoulliNB model gave 95.77% accuracy.